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ABSTRACT 

Recurrent neural networks (RNNs) are a powerful model for 
sequential data. End-to-end training methods such as Connec- 
tionist Temporal Classification make it possible to train RNNs 
for sequence labelling problems where the input-output align- 
ment is unknown. The combination of these methods with 
the Long Short-term Memory RNN architecture has proved 
particularly fruitful, delivering state-of-the-art results in cur- 
sive handwriting recognition. However RNN performance in 
speech recognition has so far been disappointing, with better 
results returned by deep feedforward networks. This paper in- 
vestigates deep recurrent neural networks, which combine the 
multiple levels of representation that have proved so effective 
in deep networks with the flexible use of long range context 
that empowers RNNs. When trained end-to-end with suit- 
able regularisation, we find that deep Long Short-term Mem- 
ory RNNs achieve a test set error of 17.7% on the TIMIT 
phoneme recognition benchmark, which to our knowledge is 
the best recorded score. 

Index Terms — recurrent neural networks, deep neural 
networks, speech recognition 

1. INTRODUCTION 

Neural networks have a long history in speech recognition, 
usually in combination with hidden Markov models |[Tl|2l- 
They have gained attention in recent years with the dramatic 
improvements in acoustic modelling yielded by deep feed- 
forward networks Given that speech is an inherently 
dynamic process, it seems natural to consider recurrent neu- 
ral networks (RNNs) as an alternative model. HMM-RNN 
systems H have also seen a recent revival lISllTl, but do not 
currently perform as well as deep networks. 

Instead of combining RNNs with HMMs, it is possible 
to train RNNs 'end-to-end' for speech recognition ll8ll9l [T0l . 
This approach exploits the larger state-space and richer dy- 
namics of RNNs compared to HMMs, and avoids the prob- 
lem of using potentially incorrect alignments as training tar- 
gets. The combination of Long Short-term Memory [H I, an 
RNN architecture with an improved memory, with end-to-end 
training has proved especially effective for cursive handwrit- 
ing recognition lfT2l |T3i . However it has so far made little 
impact on speech recognition. 



RNNs are inherently deep in time, since their hidden state 
is a function of all previous hidden states. The question that 
inspired this paper was whether RNNs could also benefit from 
depth in space; that is from stacking multiple recurrent hid- 
den layers on top of each other, just as feedforward layers are 
stacked in conventional deep networks. To answer this ques- 
tion we introduce deep Long Short-term Memory RNNs and 
assess their potential for speech recognition. We also present 
an enhancement to a recently introduced end-to-end learning 
method that jointly trains two separate RNNs as acoustic and 
linguistic models |10|. Sections|2]and|3]describe the network 
architectures and training methods, Section[4]provides exper- 
imental results and concluding remarks are given in Section|5] 

2. RECURRENT NEURAL NETWORKS 

Given an input sequence x = [xi, . . . ,xt), ^ standard recur- 
rent neural network (RNN) computes the hidden vector se- 
quence h = {hi, . . . , hT) and output vector sequence y = 
(j/i, . . . , 2/t) by iterating the following equations from t = 1 
toT: 

ht^n {W^hXt + Whhht-i + hu) (1) 
yt = Whyht + hy (2) 

where the W terms denote weight matrices (e.g. Wxh is the 
input-hidden weight matrix), the h terms denote bias vectors 
(e.g. hh is hidden bias vector) and T-L is the hidden layer func- 
tion. 

Ti. is usually an elementwise application of a sigmoid 
function. However we have found that the Long Short-Term 
Memory (LSTM) architecture |11 1, which uses purpose-built 
memory cells to store information, is better at finding and ex- 
ploiting long range context. Fig. [T] illustrates a single LSTM 
memory cell. For the version of LSTM used in this paper 1 14| 
T-L is implemented by the following composite function: 

it = cr {W^iXt + Whiht-1 + WctCt-i + hi) (3) 

It^(y {W^fXt + Whfht-i + WcfCt-i + bf) (4) 

ct = ftct-i + it tanh {W^cXt + Whcht-i + &c) (5) 

ot = cr {W^oXt + Whoht-i + W^oCt + ho) (6) 

ht = Ot tanh(ct) (7) 

where a is the logistic sigmoid function, and i, f, a and c 
are respectively the input gate, forget gate, output gate and 
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Fig. 1. Long Short-term Memory Cell 
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A crucial element of the recent success of hybrid HMM- 
neural network systems is the use of deep architectures, which 
are able to build up progressively higher level representations 
of acoustic data. Deep RNNs can be created by stacking mul- 
tiple RNN hidden layers on top of each other, with the out- 
put sequence of one layer forming the input sequence for the 
next. Assuming the same hidden layer function is used for 
all N layers in the stack, the hidden vector sequences ft." are 
iteratively computed from n = 1 to iV and t = 1 to T: 

h'i = n {Wh^^ih,.hr' + Wh^h^^h^^i + bi) (11) 

where we define ^ x. The network outputs yt are 

yt = Wu«yh'^ + by (12) 

Deep bidirectional RNNs can be implemented by replacing 
each hidden sequence ft" with the forward and backward se- 
quences ft " and ft ", and ensuring that every hidden layer 
receives input from both the forward and backward layers at 
the level below. If LSTM is used for the hidden layers we get 
deep bidirectional LSTM, the main architecture used in this 
paper. As far as we are aware this is the first time deep LSTM 
has been applied to speech recognition, and we find that it 
yields a dramatic improvement over single-layer LSTM. 

3. NETWORK TRAINING 



Fig. 2. Bidirectional RNN 

cell activation vectors, all of which are the same size as the 
hidden vector h. The weight matrices from the cell to gate 
vectors (e.g. Wsi) are diagonal, so element m in each gate 
vector only receives input from element m of the cell vector 
One shortcoming of conventional RNNs is that they are 
only able to make use of previous context. In speech recog- 
nition, where whole utterances are transcribed at once, there 
is no reason not to exploit future context as well. Bidirec- 
tional RNNs (BRNNs) |T5l do this by processing the data in 
both directions with two separate hidden layers, which are 
then fed forwards to the same output layer. As illustrated in 
Fig. [2] a BRNN computes \ht forward hidden sequence ft , 
the backward hidden sequence ^ and the output sequence y 
by iterating the backward layer from t = T to 1, the forward 
layer from t = 1 to T and then updating the output layer: 

^^-'^{w^-t^t + W^^t,^, + b^) (8) 

+ + (9) 

y^^W^^tt + W^t,+by (10) 

Combing BRNNs with LSTM gives bidirectional LSTM [76), 
which can access long-range context in both input directions. 



We focus on end-to-end training, where RNNs learn to map 
directly from acoustic to phonetic sequences. One advantage 
of this approach is that it removes the need for a predefined 
(and error-prone) alignment to create the training targets. The 
first step is to to use the network outputs to parameterise a 
differentiable distribution Pr(y|a;) over all possible phonetic 
output sequences y given an acoustic input sequence x. The 
log-probability logPr(z|£c) of the target output sequence z 
can then be differentiated with respect to the network weights 
using backpropagation through time |,17J, and the whole sys- 
tem can be optimised with gradient descent. We now describe 
two ways to define the output distribution and hence train the 
network. We refer throughout to the length of x as T, the 
length of z as U, and the number of possible phonemes as K. 

3.1. Connectionist Temporal Classification 

The first method, known as Connectionist Temporal Classi- 
fication (CTC) lISllSl, uses a softmax layer to define a sepa- 
rate output distribution Pr(fc|t) at every step t along the in- 
put sequence. This distribution covers the K phonemes plus 
an extra blank symbol which represents a non-output (the 
softmax layer is therefore size K + 1). Intuitively the net- 
work decides whether to emit any label, or no label, at every 
timestep. Taken together these decisions define a distribu- 
tion over alignments between the input and target sequences. 
CTC then uses a forward-backward algorithm to sum over all 



possible alignments and determine the normalised probability 
Pr(2;|a;) of the target sequence given the input sequence |8|. 
Similar procedures have been used elsewhere in speech and 
handwriting recognition to integrate out over possible seg- 
mentations ifTSl [T9l ; however CTC differs in that it ignores 
segmentation altogether and sums over single-timestep label 
decisions instead. 

RNNs trained with CTC are generally bidirectional, to en- 
sure that every Pr(fc|<) depends on the entire input sequence, 
and not just the inputs up to t. In this work we focus on deep 
bidirectional networks, with Pr(fc|t) defined as follows: 



Pl{k\t) = 



(13) 
(14) 



where yt[k] is the fc*'* element of the length K + 1 unnor- 
malised output vector yt, and N is the number of bidirectional 
levels. 



3.2. RNN Transducer 

CTC defines a distribution over phoneme sequences that de- 
pends only on the acoustic input sequence x. It is therefore 
an acoustic-only model. A recent augmentation, known as an 
RNN transducer ITOl combines a CTC-like network with a 
separate RNN that predicts each phoneme given the previous 
ones, thereby yielding a jointly trained acoustic and language 
model. Joint LM-acoustic training has proved beneficial in 
the past for speech recognition ||201|2T1 . 

Whereas CTC determines an output distribution at every 
input timestep, an RNN transducer determines a separate dis- 
tribution Pi{k\t, u) for every combination of input timestep t 
and output timestep u. As with CTC, each distribution cov- 
ers the K phonemes plus 0. Intuitively the network 'de- 
cides' what to output depending both on where it is in the 
input sequence and the outputs it has akeady emitted. For a 
length U target sequence z, the complete set of TU decisions 
jointly determines a distribution over all possible alignments 
between x and z, which can then be integrated out with a 
forward-backward algorithm to determine log Pr(2;|a;) 1 10|. 

In the original formulation Pr(/c|i, u) was defined by tak- 
ing an 'acoustic' distribution Pr(fc|t) from the CTC network, 
a 'linguistic' distribution Pr(fc|M) from the prediction net- 
work, then multiplying the two together and renormalising. 
An improvement introduced in this paper is to instead feed 
the hidden activations of both networks into a separate feed- 
forward output network, whose outputs are then normalised 
with a softmax function to yield Pr(A;|t,M). This allows a 
richer set of possibilities for combining linguistic and acous- 
tic information, and appears to lead to better generalisation. 
In particular we have found that the number of deletion errors 
encountered during decoding is reduced. 



Denote hy and h ^ the uppermost forward and 

backward hidden sequences of the CTC network, and by p 
the hidden sequence of the prediction network. At each t, u 
the output network is implemented by feeding h ^ and h ^ 
to a linear layer to generate the vector It, then feeding It and 
Pu to a tanh hidden layer to yield ht.u, and finally feeding 
ht^u to a size K ^ \ softmax layer to determine Pr(fc|i, u): 



N 



h (15) 

ht^u = tanh {WihU^u + WpbPu + hh) (16) 

Vt,u = Whyht^u + by (17) 
exp(yt,„[fc]) 



Pr(fc|t,w) = 



Es^=iexp(2/t,«[fc'])' 



(18) 



where yt,uW\ is the k element of the length K + 1 unnor- 
malised output vector For simplicity we constrained all non- 
output layers to be the same size (| = \h^\ — \py\ = 
\h\ = \ht,u\)', however they could be varied independently. 

RNN transducers can be trained from random initial 
weights. However they appear to work better when initialised 
with the weights of a pretrained CTC network and a pre- 
trained next-step prediction network (so that only the output 
network starts from random weights). The output layers (and 
all associated weights) used by the networks during pretrain- 
ing are removed during retraining. In this work we pre train 
the prediction network on the phonetic transcriptions of the 
audio training data; however for large-scale applications it 
would make more sense to pretrain on a separate text corpus. 

3.3. Decoding 

RNN transducers can be decoded with beam search ifTOll to 
yield an n-best list of candidate transcriptions. In the past 
CTC networks have been decoded using either a form of best- 
first decoding known as prefix search, or by simply taking the 
most active output at every timestep |l8l. In this work however 
we exploit the same beam search as the transducer, with the 
modification that the output label probabilities Pr(/c|i, u) do 
not depend on the previous outputs (so Pr(fc|<, u) = Pr(fc|t)). 
We find beam search both faster and more effective than pre- 
fix search for CTC. Note the n-best list from the transducer 
was originally sorted by the length normalised log-probabilty 
logPr(y)/|y|; in the current work we dispense with the nor- 
malisation (which only helps when there are many more dele- 
tions than insertions) and sort by Pr(j/). 

3.4. Regularisation 

Regularisation is vital for good performance with RNNs, as 
their flexibility makes them prone to overfitting. Two regu- 
larisers were used in this paper: early stopping and weight 
noise (the addition of Gaussian noise to the network weights 
during training |22 1). Weight noise was added once per train- 
ing sequence, rather than at every timestep. Weight noise 



tends to 'simplify' neural networks, in the sense of reducing 
the amount of information required to transmit the parame- 
ters |,23j|24J, which improves generaUsation. 

4. EXPERIMENTS 

Phoneme recognition experiments were performed on the 
TIMIT corpus 125]. The standard 462 speaker set with all 
SA records removed was used for training, and a separate 
development set of 50 speakers was used for early stop- 
ping. Results are reported for the 24-speaker core test set. 
The audio data was encoded using a Fourier-transform-based 
filter-bank with 40 coefficients (plus energy) distributed on 
a mel-scale, together with their first and second temporal 
derivatives. Each input vector was therefore size 123. The 
data were normalised so that every element of the input vec- 
tors had zero mean and unit variance over the training set. All 
61 phoneme labels were used during training and decoding 
(so K = 61), then mapped to 39 classes for scoring ||26| . 
Note that all experiments were run only once, so the vari- 
ance due to random weight initialisation and weight noise is 
unknown. 

As shown in Table [T| nine RNNs were evaluated, vary- 
ing along three main dimensions: the training method used 
(CTC, Transducer or pretrained Transducer), the number of 
hidden levels (1-5), and the number of LSTM cells in each 
hidden layer. Bidirectional LSTM was used for all networks 
except CTC-31-500h-tanh, which had tanh units instead of 
LSTM cells, and CTC-31-421h-uni where the LSTM layers 
were unidirectional. All networks were trained using stochas- 
tic gradient descent, with learning rate 10"''^, momentum 0.9 
and random initial weights drawn uniformly from [—0.1, 0.1]. 
All networks except CTC-31-500h-tanh and PreTrans-31-250h 
were first trained with no noise and then, starting from the 
point of highest log-probability on the development set, re- 
trained with Gaussian weight noise (a = 0.075) until the 
point of lowest phoneme error rate on the development set. 
PreTrans-31-250h was initialised with the weights of CTC- 
31-250h, along with the weights of a phoneme prediction net- 
work (which also had a hidden layer of 250 LSTM cells), both 
of which were trained without noise, retrained with noise, and 
stopped at the point of highest log-probability. PreTrans-31- 
250h was trained from this point with noise added. CTC-31- 
500h-tanh was entirely trained without weight noise because 
it failed to learn with noise added. Beam search decoding was 
used for all networks, with a beam width of 100. 

The advantage of deep networks is immediately obvious, 
with the error rate for CTC dropping from 23.9% to 18.4% 
as the number of hidden levels increases from one to five. 
The four networks CTC-31-500h-tanh, CTC-ll-622h, CTC- 
31-42 Ih-uni and CTC-31-250h all had approximately the same 
number of weights, but give radically different results. The 
three main conclusions we can draw from this are (a) LSTM 
works much better than tanh for this task, (b) bidirectional 



Table 1. TIMIT Phoneme Recognition Results. 'Epochs' is 
the number of passes through the training set before conver- 
gence. 'PER' is the phoneme error rate on the core test set. 



Network 


Weights 


Epochs 


PER 


CTC-3L-500H-TANH 


3.7M 


107 


37.6% 


CTC-1L-250H 


0.8M 


82 


23.9% 


CTC-1L-622H 


3.8M 


87 


23.0% 


CTC-2L-250H 


2.3M 


55 


21.0% 


CTC-3L-421H-UNI 


3.8M 


115 


19.6% 


CTC-3L-250H 


3.8M 


124 


18.6% 


CTC-5L-250H 


6.8M 


150 


18.4% 


TRANS-3L-250H 


4.3M 


112 


18.3% 


PreTrans-3l-250h 


4.3M 


144 


17.7% 
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Fig. 3. Input Sensitivity of a deep CTC RNN. The heatmap 
(top) shows the derivatives of the 'ah' and 'p' outputs printed 
in red with respect to the filterbank inputs (bottom). The 
TIMIT ground truth segmentation is shown below. Note that 
the sensitivity extends to surrounding segments; this may be 
because CTC (which lacks an explicit language model) at- 
tempts to learn linguistic dependencies from the acoustic data. 

LSTM has a slight advantage over unidirectional LSTMand 
(c) depth is more important than layer size (which supports 
previous findings for deep networks (3\). Although the advan- 
tage of the transducer is slight when the weights are randomly 
initialised, it becomes more substantial when pretraining is 
used. 

5. CONCLUSIONS AND FUTURE WORK 

We have shown that the combination of deep, bidirectional 
Long Short-term Memory RNNs with end-to-end training and 
weight noise gives state-of-the-art results in phoneme recog- 
nition on the TIMIT database. An obvious next step is to ex- 
tend the system to large vocabulary speech recognition. An- 
other interesting direction would be to combine frequency- 
domain convolutional neural networks L27.I with deep LSTM. 
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